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Abstract 

Information  extraction  (IE)  systems  assist 
analysts  to  assimilate  information  from 
electronic  documents.  This  paper  focuses  on 
IE  tasks  designed  to  support  information 
discovery  applications.  Since  information 
discovery  implies  examining  large  volumes 
of  documents  drawn  from  various  sources  for 
situations  that  cannot  be  anticipated  a  priori, 
they  require  IE  systems  to  have  breadth  as 
well  as  depth.  This  implies  the  need  for  a 
domain-independent  IE  system  that  can 
easily  be  customized  for  specific  domains: 
end  users  must  be  given  tools  to  customize 
the  system  on  their  own.  It  also  implies  the 
need  for  defining  new  intermediate  level  IE 
tasks  that  are  richer  than  the 
subject- verb- object  (SVO)  triples  produced 
by  shallow  systems,  yet  not  as  complex  as  the 
domain-specific  scenarios  defined  by  the 
Message  Understanding  Conference  (MUC). 

This  paper  describes  a  robust,  scalable  IE 
engine  designed  for  such  purposes.  It 
describes  new  IE  tasks  such  as  entity  profiles, 
and  concept-based  general  events  which 
represent  realistic  goals  in  terms  of  what  can 
be  accomplished  in  the  near-term  as  well  as 
providing  useful,  actionable  information. 
These  new  tasks  also  facilitate  the  correlation 
of  output  from  an  IE  engine  with  existing 
structured  data.  Benchmarking  results  for  the 
core  engine  and  applications  utilizing  the 
engine  are  presented. 

1  Introduction 

This  paper  focuses  on  new  intermediate  level 
information  extraction  tasks  that  are  defined  and 
implemented  in  an  IE  engine,  named  InfoXtract. 
InfoXtract  is  a  domain  independent,  but  portable 
information  extraction  engine  that  has  been  designed 
for  information  discovery  applications. 


The  last  decade  has  seen  great  advances  in  the  area 
of  IE.  In  the  US,  MUC  [Chinchor  &  Marsh  1998]  has 
been  the  driving  force  for  developing  this  technology. 

The  most  successful  IE  task  thus  far  has  been 
Named  Entity  (NE)  tagging.  The  state-of-the-art 
exemplified  by  systems  such  as  NetOwl  [Krupka  & 
Hausman  1998],  IdentiFinder  [Miller  et  al  1998]  and 
InfoXtract  [Srihari  et  al  2000]  has  reached  near  human 
performance,  with  90%  or  above  F-measure.  On  the 
other  hand,  the  deep  level  MUC  IE  task  Scenario 
Template  (ST)  is  designed  to  extract  detailed 
information  for  predefined  event  scenarios  of  interest. 
It  involves  filling  the  slots  of  complicated  templates.  It 
is  generally  felt  that  this  task  is  too  ambitious  for 
commercial  application  at  present. 

Information  Discovery  (ID)  is  a  term  which  has 
traditionally  been  used  to  describe  efforts  in  data 
mining  [Han  1999].  The  goal  is  to  extract  novel 
patterns  of  transactions  which  may  reveal  interesting 
trends.  The  key  assumption  is  that  the  data  is  already 
in  a  structured  form.  ID  in  this  paper  is  defined  within 
the  context  of  unstructured  text  documents;  it  is  the 
ability  to  extract,  normalize/disambiguate,  merge  and 
link  entities,  relationships,  and  events  which  provides 
significant  support  for  ID  applications.  Furthermore, 
there  is  a  need  to  accumulate  information  across 
documents  about  entities  and  events.  Due  to  rapidly 
changing  events  in  the  real  world,  what  is  of  no 
interest  one  day,  may  be  especially  interesting  the 
following  day.  Thus,  information  discovery 
applications  demand  breadth  and  depth  in  IE 
technology. 

A  variety  of  IE  engines,  reflecting  various  goals  in 
terms  of  extraction  as  well  as  architectures  are  now 
available.  Among  these,  the  most  widely  used  are  the 
GATE  system  from  the  University  of  Sheffield 
[Cunningham  et  al  2003],  the  IE  components  from 
Clearforest  (www.clearforest.com),  SIFT  from  BBN 
[Miller  et  al  1998],  REES  from  SRA  [Aone  & 
Ramon- Santacruz  1998]  and  various  tools  provided 
by  Inxight  (www.inxight.com).  Of  these,  the  GATE 
system  most  closely  resembles  InfoXtract  in  terms  of 
its  goals  as  well  as  the  architecture  and  customization 
tools.  Cymfony  differentiates  itself  by  using  a  hybrid 
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model  that  efficiently  combines  statistical  and 
grammar-based  approaches,  as  well  as  by  using  an 
internal  data  structure  known  as  a  token-list  that  can 
represent  hierarchical  linguistic  structures  and  IE 
results  for  multiple  modules  to  work  on. 

The  research  presented  here  focuses  on  a  new 
intermediate  level  of  information  extraction  which 
supports  information  discovery.  Specifically,  it 
defines  new  IE  tasks  such  as  Entity  Profile  (EP) 
extraction,  which  is  designed  to  accumulate 
interesting  information  about  an  entity  across 
documents  as  well  as  within  a  discourse.  Furthermore, 
Concept-based  General  Event  (CGE)  is  defined  as  a 
domain-independent,  representation  of  event 
information  but  more  feasible  than  MUC  ST. 

InfoXtract  represents  a  hybrid  model  for  extracting 
both  shallow  and  intermediate  level  IE:  it  exploits 
both  statistical  and  grammar-based  paradigms.  A  key 
feature  is  the  ability  to  rapidly  customize  the  IE  engine 
for  a  specific  domain  and  application.  Information 
discovery  applications  are  required  to  process  an 
enormous  volume  of  documents,  and  hence  any  IE 
engine  must  be  able  to  scale  up  in  terms  of  processing 
speed  and  robustness;  the  design  and  architecture  of 
InfoXtract  reflect  this  need. 

In  the  remaining  text.  Section  2  defines  the  new 
intermediate  level  IE  tasks.  Section  3  presents 
extensions  to  InfoXtract  to  support  cross-document 
IE.  Section  4  presents  the  hybrid  technology.  Section 
5  delves  into  the  engineering  architecture  and 
implementation  of  InfoXtract.  Section  6  discusses 
domain  porting.  Section  7  presents  two  applications 
which  have  exploited  InfoXtract,  and  finally.  Section 
8  summarizes  the  research  contributions. 

2  InfoXtract:  Defining  New  IE  Tasks 

InfoXtract  [Ei  &  Srihari  2003,  Srihari  et  al  2000]  is  a 
domain-independent  and  domain-portable,  inter¬ 
mediate  level  IE  engine.  Figure  1  illustrates  the 
overall  architecture  of  the  engine. 

A  description  of  the  increasingly  sophisticated  IE 
outputs  from  the  InfoXtract  engine  is  given  below: 

•  NE:  Named  Entity  objects  represent  key  items 
such  as  proper  names  of  person,  organization, 
product,  location,  target,  contact  information 
such  as  address,  email,  phone  number,  URL,  time 
and  numerical  expressions  such  as  date,  year  and 
various  measurements  weight,  money, 
percentage,  etc. 

•  CE:  Correlated  Entity  objects  capture  relation¬ 
ship  mentions  between  entities  such  as  the 
affiliation  relationship  between  a  person  and  his 
employer.  The  results  will  be  consolidated  into 
the  information  object  Entity  Profile  (EP)  based 
on  co-reference  and  alias  support. 


•  EP:  Entity  Profiles  are  complex  rich  information 
objects  that  collect  entity-centric  information,  in 
particular,  all  the  CE  relationships  that  a  given 
entity  is  involved  in  and  all  the  events  this  entity 
is  involved  in.  This  is  achieved  through 
document-internal  fusion  and  cross- document 
fusion  of  related  information  based  on  support 
from  co-reference,  including  alias  association. 
Work  is  in  progress  to  enhance  the  fusion  by 
correlating  the  extracted  information  with 
information  in  a  user-provided  existing  database. 

•  GE:  General  Events  are  verb- centric  information 
objects  representing  ‘who  did  what  to  whom 
when  and  where’  at  the  logical  level. 
Concept-based  GE  (CGE)  further  requires  that 
participants  of  events  be  filled  by  EPs  instead  of 
NEs  and  that  other  values  of  the  GE  slots  (the 
action,  time  and  location)  be  disambiguated  and 
normalized. 

•  PE:  Predefined  Events  are  domain  specific  or 
user-defined  events  of  a  specific  event  type,  such 
as  Product  Eaunch  and  Company  Acquisition  in 
the  business  domain.  They  represent  a  simplified 
version  of  MUC  ST.  InfoXtract  provides  a  toolkit 
that  allows  users  to  define  and  write  their  own 
PEs  based  on  automatically  generated  PE  rule 
templates. 

The  InfoXtract  engine  has  been  deployed  both 
internally  to  support  Cymfony’s  Brand  Dashboard™ 
product  and  externally  to  a  third-party  integrator  for 
building  IE  applications  in  the  intelligence  domain. 
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Figure  1:  InfoXtract  Engine  Architecture 


3  Hybrid  Technology 

InfoXtract  represents  a  hybrid  model  for  IE  sinee  it 
eombines  both  grammar  formalisms  as  well  as 
maehine  learning.  Aehieving  the  right  balanee  of  these 
two  paradigms  is  a  major  design  objeetive  of 
InfoXtraet.  The  eore  of  the  parsing  and  information 
extraetion  proeess  in  InfoXtraet  is  organized  very 
simply  as  a  pipeline  of  proeessing  modules.  All 
modules  operate  on  a  single  in-memory  data  strueture, 
ealled  a  token  list.  A  token  list  is  essentially  a 
sequenee  of  tree  struetures,  overlaid  with  a  graph 
whose  edges  define  relations  that  may  be  either 
grammatieal  or  informational  in  nature.  The  nodes  of 
these  trees  are  ealled  tokens.  InfoXtraet’ s  typieal 
mode  of  proeessing  is  to  skim  along  the  roots  of  the 
trees  in  the  token  list,  building  up  strueture 
“strip-wise”.  So  even  non-terminal  nodes  behave,  in 
the  typieal  ease,  as  eomplex  tokens.  Representing  a 
marked  up  text  using  trees  explieitly,  rather  than 
implieitly  as  an  interpretation  of  paired  braeket 
symbols,  has  several  advantages.  For  example,  it 
allows  a  somewhat  rieher  organization  of  the 
information  eontained  “between  the  braekets,” 
allowing  us  to  eonstruet  direet  links  from  a  root  node 
to  its  semantie  head,  for  example. 

The  proeessing  modules  that  aet  on  token  lists  ean 
range  from  lexieal  lookup  to  the  applieation  of  hand 
written  grammars  to  statistieal  analysis  based  on 
maehine  learning  all  the  way  to  arbitrary  proeedures 
written  in  C++.  The  eonfiguration  of  the  InfoXtraet 
proeessing  pipeline  is  eontrolled  by  a  eonfiguration 
file,  whieh  handles  pre-loading  required  resourees  as 
well  as  ordering  the  applieation  of  modules.  Despite 
the  variety  of  implementation  strategies  available, 
InfoXtraet  Natural  Language  Proeessing  (NLP) 
modules  are  restrieted  in  what  they  ean  do  to  the  token 
list  to  aetions  of  the  following  three  types  : 

1.  Assertion  and  erasure  of  token  properties 
(features,  normal  forms,  ete.) 

2.  Grouping  token  sequenees  into  higher  level 
eonstituent  tokens. 

3.  Linking  token  pairs  with  a  relational  link. 

Grammatieal  analysis  of  the  input  text  makes  use  of  a 
eombination  of  phrase  strueture  and  relational 
approaehes  to  grammar.  Basieally,  early  modules 
build  up  strueture  to  a  eertain  level  (ineluding 
relatively  simple  noun  phrases,  verb  groups  and 
prepositional  phrases),  after  whieh  further 
grammatieal  strueture  is  represented  by  asserting 
relational  links  between  tokens.  This  mix  of  phrase 
struetural  and  relational  approaehes  is  very  similar  to 
the  approaeh  of  Lexieal  Funetional  Grammar  (LFG) 
[Kaplan  &  Bresnan  1982],  mueh  sealed  down. 

Our  grammars  are  written  in  a  formalism 
developed  for  our  own  use,  and  also  in  a  modified 


formalism  developed  for  outside  users,  based  on  our 
in-house  experienees.  In  both  eases,  the  formalism 
mixes  regular  expressions  with  boolean  expressions. 
Aetions  affeeting  the  token  list  are  implemented  as 
side  effeets  of  pattern  matehing.  So  although  our 
proeessing  modules  are  in  the  teehnieal  sense  token 
list  transdueers,  they  do  not  resemble  Finite  State 
Transdueers  (FSTs)  so  mueh  as  the  regular  expression 
based  pattern-aetion  rules  used  in  Awk  or  Lex. 
Grammars  ean  eontain  (non-reeursive)  maeros,  with 
parameters. 

This  means  that  some  long-distanee  dependeneies, 
whieh  are  very  awkward  to  represent  direetly  in  finite 
state  automata  ean  be  represented  very  eompaetly  in 
maero  form.  While  this  has  the  advantage  of 
deereasing  grammar  sizes,  it  does  inerease  the  size  of 
the  resulting  automata.  Grammars  are  eompiled  to  a 
speeial  type  of  finite  state  automata.  These  token  list 
automata  ean  be  thought  of  as  an  extension  of  tree 
walking  automata  [Monnieh  et  al  2001,  Aho  & 
Ullman  1971,  Engelfriet  et  al  1999].  These  are  linear 
automata  (as  opposed  to  standard  finite  state  tree 
automata  [Geeseg  &  Steinby  1997],  whieh  are  more 
naturally  thought  of  as  parallel)  whieh  run  over  trees. 
The  problem  with  linear  automata  on  trees  is  that  there 
ean  be  a  number  of  “next”  nodes  to  move  the  read 
head  to:  right  sister,  left  sister,  parent,  first  ehild,  ete. 
So  the  voeabulary  of  the  automaton  is  inereased  to 
inelude  not  only  symbols  that  might  appear  in  the  text 
(test  instruetions)  but  also  symbols  that  indieate  where 
to  move  the  read  head  (direetive  instruetions).  We 
have  extended  the  basie  tree  walking  formalism  in 
several  direetions.  First  we  extend  the  power  of  test 
instruetions  to  allow  them  to  eheek  features  of  the 
eurrent  node  and  to  perform  string  matehing  against 
the  semantie  head  of  the  eurrent  node  (so  that  a 
syntaetieally  eomplex  eonstituent  ean  be  matehed 
against  a  single  word).  Seeond,  we  inelude  symbols 
for  aetion  instruetions,  to  implement  side  effeets. 
Finally,  we  allow  movement  not  only  along  the  root 
sequenee  (string-automaton  style)  and  branehes  of  a 
tree  (tree-walking  style)  but  also  along  the  the 
terminal  frontier  of  the  tree  and  along  relational  links. 

These  extensions  to  standard  tree  walking 
automata  extend  the  power  of  that  formalism 
tremendously,  and  eould  pose  problems.  However,  the 
grammar  formalisms  that  eompile  into  these  token  list 
walking  automata  are  restrietive,  in  the  sense  that 
there  exist  many  token  list  transduetions  that  are 
implementable  as  automata  that  are  not 
implementable  as  grammars.  Also  the  nature  of  the 
shallow  parsing  task  itself  is  sueh  that  we  only  need  to 
dip  into  the  reserves  of  power  that  this  representation 
affords  us  on  relatively  rare  oeeasions.  As  a  result,  the 
automata  that  we  aetually  plug  into  the  InfoXtraet 
NLP  pipeline  generally  run  very  fast. 

Reeently,  we  have  developed  an  extended  finite 
state  formalism  named  Expert  Lexieon,  following  the 
general  trend  of  lexiealist  approaehes  to  NLP.  An 


expert  lexieon  rule  eonsists  of  both  grammatieal 
eomponents  as  well  as  proximity-based  keyword 
matehing.  All  Expert  Lexieon  entries  are  indexed, 
similar  to  the  ease  for  the  finite  state  tool  in  INTEX 
[Silberztein  2000].  The  pattern  matehing  time  is 
therefore  redueed  dramatieally  eompared  to  a 
sequential  finite  state  deviee. 

Some  unique  features  of  this  formalism  inelude:  (i) 
the  flexibility  of  inserting  any  number  of  Expert 
Lexieons  at  any  level  of  the  proeess;  (ii)  the  eapability 
of  proximity  eheeking  within  a  window  size  as  rule 
eonstraints  in  addition  to  pattern  matehing  using  an 
FST  eall,  so  that  the  rule  writer  ean  exploit  the 
eombined  advantages  of  both;  and  (iii)  support  for  the 
propagation  of  semantie  tagging  results,  to 
aeeommodate  prineiples  like  one  sense  per  discourse. 
Expert  lexieons  are  used  in  eustomization  of  lexieons, 
named  entity  glossaries,  and  alias  lists,  as  well  as 
eoneept  tagging. 

Both  supervised  maehine  learning  and  unsuper¬ 
vised  learning  are  used  in  InfoXtraet.  Supervised 
learning  is  used  in  hybrid  modules  sueh  as  NE  [Srihari 
et  al  2000],  NE  Normalization  [Li  et  al  2002]  and 
Co-referenee.  It  is  also  used  in  the  preproeessing 
module  for  orthographie  ease  restoration  of  ease 
insensitive  input  [Niu  et  al  2003].  Unsupervised 
learning  involves  aequisition  of  lexieal  knowledge 
and  rules  from  a  raw  eorpus.  The  former  inelude s 
word  elustering,  automatie  name  glossary  aequisition 
and  thesaurus  eonstruetion.  The  latter  involves 
bootstrapped  learning  of  NE  and  CE  rules,  similar  to 
the  teehniques  used  in  [Riloff  1996].  The  results  of 
unsupervised  learning  ean  be  post-edited  and  added  as 
additional  resourees  for  InfoXtraet  proeessing. 


Table  1:  SVO/CE  Benchmarking 


SVO 

CE 

CORRECT 

196 

48 

INCORRECT 

13 

0 

SPURIOUS 

10 

2 

MISSING 

31 

10 

PRECISION 

89.50% 

96.0% 

RECALL 

81.67% 

82.8% 

F-MEASURE 

85.41% 

88.9% 

Accuracy 

InfoXtraet  has  been  benehmarked  using  the  MUC-7 
data  sets  whieh  are  reeognized  as  standards  by  the 
researeh  eommunity.  Preeision  and  reeall  figures  for 
the  person  and  location  entity  types  were  above  90%. 
For  organization  entity  types,  preeision  and  reeall 
were  in  the  high  80’ s  refleeting  the  faet  that 
organization  names  tend  to  be  very  domain  speeifie. 
InfoXtraet  provides  the  ability  to  ereate  eustomized 
named  entity  glossaries,  whieh  will  boost  the 
performanee  of  organization  tagging  for  a  given 


domain.  No  sueh  eustomization  was  done  in  the 
testing  just  deseribed.  The  aeeuraey  of  shallow 
parsing  is  well  over  90%  refleeting  very  high 
performanee  part-of-speeeh  tagging  and  named  entity 
tagging.  Table  1  shows  the  benehmarks  for  CE 
relationships  whieh  are  the  basis  for  EPs  and  for  the 
SVO  parsing  whieh  supports  event  extraetion. 

4  Engineering  Architecture 

The  InfoXtraet  engine  has  been  developed  as  a 
modular,  distributed  applieation  and  is  eapable  of 
proeessing  up  to  20  MB  per  hour  on  a  single 
proeessor.  The  system  has  been  tested  on  very  large  (> 
1  million)  doeument  eolleetions.  The  arehiteeture 
faeilitates  the  ineorporation  of  the  engine  into  external 
applieations  requiring  an  IE  subsystem.  Requests  to 
proeess  doeuments  ean  be  submitted  through  a  web 
interfaee,  or  via  FTP.  The  results  of  proeessing  a 
doeument  ean  be  returned  in  XML.  Sinee  various 
tools  are  available  to  automatieally  populate  databases 
based  on  XML  data  models,  the  results  are  easily 
usable  in  web-enabled  database  applieations. 
Configuration  files  enable  the  system  to  be  used  with 
different  lexieal/statistieal/grammar  resourees,  as  well 
as  with  subsets  of  the  available  IE  modules. 

InfoXtraet  supports  two  modes  of  operation,  aetive 
and  passive.  It  ean  aet  as  an  aetive  retriever  of 
doeuments  to  proeess  or  aet  as  a  passive  reeeiver  of 
doeuments  to  proeess.  When  in  aetive  mode, 
InfoXtraet  is  eapable  of  retrieving  doeuments  via 
HTTP,  FTP,  or  loeal  file  system.  When  in  passive 
mode,  InfoXtraet  is  eapable  of  aeeepting  doeuments 
via  HTTP.  Figure  2  illustrates  a  multiple  proeessor 
eonfiguration  of  InfoXtraet  foeusing  on  the  typieal 
deployment  of  InfoXtraet  within  an  applieation. 


Figure  2:  High  Level  Architecture 


The  arehiteeture  faeilitates  sealability  by 
supporting  multiple,  independent  Proeessors.  The 
Proeessors  ean  be  running  on  a  single  server  (if 
multiple  CPUs  are  available)  and  on  multiple  servers. 
The  Doeument  Manager  distributes  requests  to 
proeess  doeuments  to  all  available  Proeessors.  Eaeh 
eomponent  is  an  independent  applieation.  All  direet 


inter-module  eommunieation  is  aeeomplished  using 
the  Common  Objeet  Request  Broker  Arehiteeture 
(CORE  A).  CORE  A  provides  a  robust,  programming 
language  independent,  and  platform  neutral 
meehanism  for  developing  and  deploying  distributed 
applieations.  Proeessors  ean  be  added  and  removed 
without  stopping  the  InfoXTraet  engine.  All  modules 
are  self-registering  and  will  announee  their  presenee 
to  other  modules  onee  they  have  eompleted 
initialization. 

The  Doeument  Retriever  module  is  only  used  in 
the  aetive  retriever  mode.  It  is  responsible  for 
retrieving  doeuments  from  a  eontent  provider  and 
storing  the  doeuments  for  use  by  the  InfoXtraet 
Controller.  The  Doeument  Retriever  handles  all 
interfaeing  with  the  eontent  provider’s  retrieval 
proeess,  ineluding  interfaee  protoeol  (authentieation, 
retrieve  requests,  ete.),  throughput  management,  and 
doeument  paekaging.  It  is  tested  to  be  able  to  retrieve 
doeuments  from  eontent  providers  sueh  as  Northern 
Light,  Faetiva,  and  LexisNexis.  Sinee  the  Doeument 
Retriever  and  the  InfoXtraet  Controller  do  not 
eommunieate  direetly,  it  is  possible  to  run  the 
Doeument  Retriever  standalone  and  proeess  all 
retrieved  doeuments  in  a  bateh  mode  at  a  later  time. 

The  InfoXtraet  Controller  module  is  used  only  in 
the  aetive  retriever  mode.  It  is  responsible  for 
retrieving  doeuments  to  be  proeessed,  submitting 
doeuments  for  proeessing,  storing  extraeted 
information,  and  system  logging.  The  InfoXtraet 
Controller  is  a  multi-threaded  applieation  that  is 
eapable  of  submitting  multiple  simultaneous  requests 
to  the  Doeument  Manager.  As  proeessing  results  are 
returned,  they  are  stored  to  a  repository  or  database,  an 
XML  file,  or  both. 

The  Doeument  Manager  module  is  responsible  for 
managing  doeument  submission  to  available 
Proeessors.  As  Proeessors  are  initialized,  they  register 
with  the  Doeument  Manager.  The  Doeument  Manager 
uses  a  round  robin  seheduling  algorithm  for  sending 
doeuments  to  available  Proeessors.  A  doeument  queue 
is  maintained  with  a  size  of  four  doeuments  per 
Proeessor.  The  Proeessor  module  forms  the  eore  of  the 
IE  engine.  InfoXtraet  utilizes  a  multi-level  approaeh 
to  NLP.  Eaeh  level  utilizes  the  results  of  the  previous 
levels  in  order  to  aehieve  more  sophistieated  parsing. 
The  JIX  module  is  a  web  applieation  that  is 
responsible  for  aeeepting  requests  for  doeuments  to  be 
proeessed.  This  module  is  only  used  in  the  passive 
mode.  The  doeument  requests  are  reeeived  via  the 
HTTP  Post  request.  Proeessing  results  are  returned  in 
XME  format  via  the  HTTP  Post  response. 

In  Table  2  we  present  an  example  of  the 
performanee  that  ean  be  expeeted  based  on  the 
applieation  of  all  modules  within  the  engine.  It  should 
be  noted  that  eonsiderably  faster  proeessing  per 
proeessor  ean  be  aehieved  if  output  is  restrieted  to  a 
eertain  IE  level,  sueh  as  named  entity  tagging  only. 
The  output  in  this  benehmark  ineludes  all  major  tasks 


sueh  as  NE,  EP,  parsing  and  event  extraetion  as  well 
as  XME  generation. 

This  eonfiguration  provides  throughput  of 
approximately  12,000  doeuments  (avg.  10KB)  per 
day.  A  smaller  average  doeument  size  will  inerease 
the  doeument  throughput.  Inereased  throughput  ean 
be  aehieved  by  dedieating  a  CPU  for  eaeh  running 
Proeessor.  Eaeh  Proeessor  instanee  requires 
approximately  500  MB  of  RAM  to  run  effieiently. 
Proeessing  speed  inereases  linearly  with  additional 
Proeessors/CPUs,  and  CPU  speed.  In  the  eurrent  state, 
with  no  speed  optimization,  using  a  bank  of  eight 
proeessors,  it  is  able  to  proeess  approximately 
100,000  doeuments  per  day.  Thus,  InfoXtraet  is 
suitable  for  high  volume  deployments.  The  use  of 
COREA  provides  seamless  inter-proeess  and 
over-the-wire  eommunieation  between  modules. 
Computing  resourees  ean  be  dynamieally  assigned  to 
handle  inereases  in  doeument  volume. 


Table  2:  Benchmark  for  Efficiency 


Server 

Configuration 

2  CPU  @  1  GHz,  2  GB 
RAM 

Operating  System 

Redhat  Einux  7.2 

Doeument 
Colleetion  Size 

500  Doeuments,  5  MB 
total  size 

Engine 

Configuration 

InfoXtraet  Controller, 
Doeument  Manager, 

and  2  Proeessors 

running  on  a  single 
server 

Proeessing  Time 

30  Minutes 

A  standard  doeument  input  model  is  used  to 
develop  effeetive  preproeessing  eapabilities. 
Preproeessing  adapts  the  engine  to  the  souree  by 
presenting  metadata,  zoning  information  in  a 
standardized  format  and  performing  restoration  tasks 
(e.g.  ease  restoration).  Efforts  are  underway  to 
eonfigure  the  engine  sueh  that  zone-speeifie 
proeessing  eontrols  are  enabled.  For  example,  zones 
identified  as  titles  or  subtitles  must  be  tagged  using 
different  eriteria  than  running  text.  The  engine  has 
been  deployed  on  a  variety  of  input  formats  ineluding 
HUMINT  doeuments  (all  upperease),  the  Foreign 
Broadeast  Information  Serviees  feed  (FBIS),  live 
feeds  from  eontent  providers  sueh  as  Faetiva  (Dow 
Jones/Reuters),  EexisNexis,  as  well  as  web  pages.  A 
user-trainable,  high-performanee  ease  restoration 
module  [Niu  et  al  2003]  has  been  developed  that 
transforms  ease  insensitive  input  sueh  as  speeeh 
transeripts  into  mixed-ease  before  being  proeessed  by 
the  engine.  The  ease  restoration  module  eliminates  the 
need  for  separate  IE  engines  for  ease-insensitive  and 
ease-sensitive  doeuments;  this  is  easier  and  more  eost 
effeetive  to  maintain. 


5  Corpus-level  IE 


Efforts  have  extended  IE  from  the  doeument  level  to 
the  eorpus  level.  Although  most  IE  systems  perform 
eorpus-level  information  eonsolidation  at  an 
applieation  level,  it  is  felt  that  mueh  ean  be  gained  by 
doing  this  as  an  extended  step  in  the  IE  engine.  A 
repository  has  been  developed  for  InfoXtraet  that  is 
able  to  hold  the  results  of  proeessing  an  entire  eorpus. 
A  proprietary  indexing  seheme  for  indexing  token-list 
data  has  been  developed  that  enables  querying  over 
both  the  linguistie  struetures  as  well  as  statistieal 
similarity  queries  (e.g.,  the  similarity  between  two 
doeuments  or  two  entity  profiles).  The  repository  is 
used  by  a  fusion  module  in  order  to  generate 
eross-doeument  entity  profiles  as  well  as  for  text 
mining  operations.  The  results  of  the  repository 
module  ean  be  subsequently  fed  into  a  relational 
database  to  support  applieations.  This  has  the 
advantage  of  filtering  mueh  of  the  noise  from  the 
engine  level  and  doing  sophistieated  information 
eonsolidation  before  populating  a  relational  database. 
The  arehiteeture  of  these  subsequent  stages  is  shown 
in  Figure  3. 
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Figure  3:  Extensions  to  InfoXtraet 


Information  Extraetion  has  two  anehor  points:  (i) 
entity-eentrie  information  whieh  leads  to  an  EP,  and 
(ii)  aetion-eentrie  information  whieh  leads  to  an  event 
seenario.  Compared  with  the  eonsolidation  of 
extraeted  events  into  eross-doeument  event  seenario, 
eross-doeument  EP  merging  and  eonsolidation  is  a 
more  tangible  task,  based  mainly  on  resolving  aliases. 
Even  with  modest  reeall,  the  eorpus-level  EP 
demonstrates  tremendous  value  in  eolleeting 
information  about  an  entity.  This  is  as  shown  in  Table 
3  for  only  part  of  the  profile  of  ‘Mohamed  Atta’  from 
one  experiment  based  on  a  eolleetion  of  news  artieles. 
The  extraeted  EP  eentralizes  a  signifieant  amount  of 
valuable  information  about  this  terrorist. 

6  Domain  Porting 

Considerable  efforts  have  been  made  to  keep  the  eore 
engine  as  domain  independent  as  possible;  domain 
speeialization  or  tuning  happens  with  minimum 


ehange  to  the  eore  engine,  assisted  by  automatie  or 
semi-automatie  domain  porting  tools  we  have 
developed. 

Cymfony  has  taken  several  distinet  approaehes  in 
aehieving  domain  portability:  (i)  the  use  of  a  standard 
doeument  input  model,  pre-proeessors  and 
eonfiguration  seripts  in  order  to  tailor  input  and  output 
formats  for  a  given  applieation,  (ii)  the  use  of  tools  in 
order  to  eustomize  lexieons  and  grammars,  and  (iii) 
unsupervised  maehine  learning  teehniques  for 
learning  new  named  entities  (e.g.  weapons)  and 
relationships  based  on  sample  seeds  provided  by  a 
user. 


Table  3:  Sample  Entity  Profile 


Name 

Mohamed  Atta 

Aliases 

Atta;  Mohamed 

Position 

apparent  mastermind; 
ring  leader;  engineer;  leader 

Age 

33;  29;  3  3 -year- old; 

34-year-old 

Where-from 

United  Arab  Emirates; 

Spain;  Hamburg;  Egyptian; 

Modifiers 

on  the  first  plane;  evasive; 
ready;  in  Spain;  in  seat  8D. . . 

Deseriptors 

hijaeker;  al-Amir;  purported 
ringleader;  a  square -jawed 
33-year-old  pilot; . 

Assoeiation 

bin  Eaden;  Abdulaziz 

Alomari;  Hani  Hanjour; 
Madrid;  Ameriean  Media 

Ine.; . 

Involved-events 

move-events  (2); 
aeeuse-events  (9), 
e on viet- events  (10), 
eonfess-events  (2), 
arrest-events  (3), 
rent-events  (3),  . 

It  has  been  one  of  Cymfony’ s  primary  objeetives 
to  faeilitate  domain  portability  [Srihari  1998]  [Ei  & 
Srihari  2000a,b,  2003].  This  has  resulted  in  a 
development/eustomization  environment  known  as 
the  Eexieon  Grammar  Development  Environment 
(EGDE).  The  EGDE  permits  users  to  modify  named 
entity  glossaries,  alias  lexieons  and  general-purpose 
lexieons.  It  also  supports  example-based  grammar 
writing;  users  ean  find  events  of  interest  in  sample 
doeuments,  proeess  these  through  InfoXtraet  and 
modify  the  eonstraints  in  the  automatieally  generated 
rule  templates  for  event  deteetion.  With  some  basie 
training,  users  ean  easily  use  the  EGDE  to  eustomize 
InfoXtraet  for  their  applieations.  This  faeilitates 
eustomization  of  the  system  in  user  applieations 
where  aeeess  to  the  input  data  to  InfoXtraet  is 
restrieted. 


7  Applications 

The  InfoXtract  engine  has  been  used  in  two 
applieations,  the  Information  Diseovery  Portal  (IDP) 
and  Brand  Dashboard  (www.branddashboard. 
eom).  The  IDP  supports  both  the  traditional  top-down 
methods  of  browsing  through  large  volumes  of 
information  as  well  as  novel,  data-driven  browsing.  A 
sample  user  interfaee  is  shown  in  Figure  4. 

Users  may  seleet  “wateh  lists”  of  entities  (people, 
organizations,  targets,  ete.)  that  they  are  interested  in 
monitoring.  Users  may  also  eustomize  the  sourees  of 
information  they  are  interested  in  proeessing. 
Top-down  methods  inelude  topie-eentrie  browsing 
whereby  doeuments  are  elassified  by  topies  of 
interest.  IE-based  browsing  teehniques  inelude 
entity- eentrie  and  event-eentrie  browsing. 
Entity-eentrie  browsing  permits  users  to  traek  key 
entities  (people,  organizations,  targets)  of  interest  and 
monitor  information  pertaining  to  them.  Event-eentrie 
browsing  foeuses  on  signifieant  aetions  ineluding 
money  movement  and  people  movement  events. 
Visualization  of  extraeted  information  is  a  key 
eomponent  of  the  IDP.  The  Information  Mesh  enables 
a  user  to  visualize  an  entity,  its  attributes  and  its 
relation  to  other  entities  and  events.  Starting  from  an 
entity  (or  event),  relationship  ehains  ean  be  traversed 
to  explore  related  items.  Timelines  faeilitate 
visualization  of  information  in  the  temporal  axis. 


Figure  4:  Information  Discovery  Portal 


Recent  efforts  have  included  a  tight  integration  of 
InfoXtract  with  visualization  tools  such  as  the 
Web-based  Timeline  Analysis  System  (WebTAS) 
(http://www.webtas.com).  The  IDP  reflects  the  ability 
for  users  to  select  events  of  interest  and  automatically 
export  them  to  WebTAS  for  visualization.  Efforts  are 
underway  to  integrate  higher-level  event  scenario 
analysis  tools  such  as  the  Terrorist  Modus  Operandi 
Detection  System  (TMODS)  (www.21  technologies 
.com)  into  the  IDP. 

Brand  Dashboard  is  a  commercial  application  for 
marketing  and  public  relations  organizations  to 


measure  and  assess  media  perception  of  consumer 
brands.  The  InfoXtract  engine  is  used  to  analyze 
several  thousand  electronic  sources  of  information 
provided  by  various  content  aggregators  (Factiva, 
EexisNexis,  etc.).  The  engine  is  focused  on  tagging 
and  generating  brand  profiles  that  also  capture  salient 
information  such  as  the  descriptive  phrases  used  in 
describing  brands  (e.g.  cost-saving,  non-habit 
forming)  as  well  as  user- configurable  specific 
messages  that  companies  are  trying  to  promote  and 
track  (safe  and  reliable,  industry  leader,  etc.).  The 
output  from  the  engine  is  fed  into  a  database-driven 
web  application  which  then  produces  report  cards  for 
brands  containing  quantitative  metrics  pertaining  to 
brand  perception,  as  well  as  qualitative  information 
describing  characteristics.  A  sample  screenshot  from 
Brand  Dashboard  is  presented  in  Figure  5.  It  depicts  a 
report  card  for  a  particular  brand,  highlighting  brand 
strength  as  well  as  highlighting  metrics  that  have 
changed  the  most  in  the  last  time  period.  The  “buzz 
box”  on  the  right  hand  side  illustrates 
companies/brands,  people,  analysts,  and  messages 
most  frequently  associated  with  the  brand  in  question. 


Overall  Brand  Strength 


Clippings  2,.  „ 

^ 


Greatest  Metric  Change 

►  MeJ.j  has  changed  most  this  week  from  the  13- 
week  average  (down  -50.77%) 


Top  Sources 
Where  Abbott 
Laboratories  is  being 


Figure  5:  Report  Card  from  Brand  Dashboard 


8  Summary  and  Future  Work 

This  paper  has  described  the  motivation  behind 
InfoXtract,  a  domain  independent,  portable, 
intermediate-level  IE  engine.  It  has  also  discussed  the 
architecture  of  the  engine,  both  from  an  algorithmic 
perspective  and  software  engineering  perspective. 
Current  efforts  to  improve  InfoXtract  include  the 
following:  support  for  more  diverse  input  formats, 
more  use  of  metadata  in  the  extraction  tasks,  support 


for  structured  data,  and  capabilities  for  processing 
foreign  languages.  Finally,  support  for  more  intuitive 
domain  customization  tools,  especially  the 
semi-automatic  learning  tools  is  a  major  focus. 

Acknowledgments 

The  authors  wish  to  thank  Carrie  Pine  of  AFRL  for 
reviewing  and  supporting  this  work. 

References 

[Aho  &  Ullman  1971]  Alfred  V.  Aho  and  Jeffrey 

D.  Ullman.  Translations  on  a  context-free  grammar. 
Information  and  Control,  1 9(5) :43 9^75,  1971. 

[Aone  &  Ramos-Santacruz  1998]  REES:  A 
Earge- Scale  Relation  and  Event  Extraction  System, 
url:  http://acl.ldc.upenn.edU/A/A00/A00-101 1. pdf 

[Chinchor  &  Marsh  1998]  Chinchor,  N.  &  Marsh, 

E.  1998.  MUC-7  Information  Extraction  Task 
Definition  (version  5.1),  Proceedings  of  MUC-7. 

[Cunningham  et  al  2003]  Hamish  Cunningham  et 
al.  Developing  Eanguage  Processing  Components 
with  GATE:  A  User  Guide. 

http://gate.ac.Uk/sale/tao/index.html#annie 

[Engelfriet  et  al  1999]  Joost  Engelfriet,  Hendrik 
Jan  Hoogeboom,  and  Jan-Pascal  Van  Best.  Trips  on 
tvQQ^.Acta  Cybernetiea,  14(l):51-64,  1999. 

[Gecseg  &  Steinby  1997]  Ferenc  Gecseg  and 
Magnus  Steinby.  Tree  languages.  In  Grzegorz 
Rozenberg  and  Arto  Salomaa,  editors.  Handbook  of 
Formal  Languages:  Beyond  Words,  volume  3,  pages 
1-68,  Berlin,  1997.  Springer 

[Han  1999]  Han,  J.  Data  Mining.  1999.  In  J. 
Urban  and  P.  Dasgupta  (eds.).  Encyclopedia  of 
Distributed  Computing,  Kluwer  Academic  Publishers. 

[Hobbs  1993]  J.  R.  Hobbs,  1993.  FASTUS:  A 
System  for  Extracting  Information  from  Text, 
Proceedings  of  the  DARPA  workshop  on  Human 
Eanguage  Technology”,  Princeton,  NJ,  133-137. 

[Kaplan  &  Bresnan  1982]  Ronald  M.  Kaplan  and 
Joan  Bresnan.  Eexical-Functional  Grammar:  A  formal 
system  for  grammatical  representation.  In  Joan 
Bresnan,  editor.  The  Mental  Representation  of 
Grammatieal  Relations,  pages  173-281.  The  MIT 
Press,  Cambridge,  MA,  1982. 

[Krupka  &  Hausman  1998]  G.  R  Krupka  and  K. 
Hausman,  ‘TsoQuest  Inc:  Description  of  the  NetOwl 
Text  Extraction  System  as  used  for  MUC-7”,  MUC-7 

[Ei  et  al  2002]  Ei,  H.,  R.  Srihari,  C.  Niu,  and  W.  Ei 
(2002).  Eocalization  Normalization  for  Information 
Extraction.  COEING  2002,  549-555,  Taipei,  Taiwan. 

[Ei,  W  &  R.  Srihari  2000a].  A  Domain 
Independent  Event  Extraction  Toolkit,  Final 


Technical  Report,  Air  Force  Research  Eaboratory, 
Information  Directorate,  Rome  Research  Site,  New 
York 

[Ei,  W  &  R.  Srihari  2000b].  Flexible  Information 
Extraction  Eeaming  Algorithm,  Final  Technical 
Report,  Air  Force  Research  Eaboratory,  Information 
Directorate,  Rome  Research  Site,  New  York 

[Ei  &  Srihari  2003]  Ei,  W.  and  R.  K.  Srihari  (2003) 
Intermediate-Eevel  Event  Extraction  for  Temporal 
and  Spatial  Analysis  and  Visualization,  Final 
Technical  Report  AFRE-IF-RS-TR-2002-245,  Air 
Force  Research  Eaboratory,  Information  Directorate, 
Rome  Research  Site,  New  York. 

[Miller  et  al  1998]  Miller,  Scott;  Crystal,  Michael; 
Fox,  Heidi;  Ramshaw,  Eance;  Schwartz,  Richard; 
Stone,  Rebecca;  Weischedel,  Ralph;  and  Annotation 
Group,  the  1998.  Algorithms  that  Learn  to  Extract 
Information;  BBN:  Description  of  the  SIFT  System  as 
Used  for  MUC-7. 

[Monnich  et  al  2001]  Uwe  Monnich,  Frank 
Morawietz,  and  Stephan  Kepser.  A  regular  query  for 
context-sensitive  relations.  In  Steven  Bird,  Peter 
Buneman,  and  Mark  Eiberman,  editors,  IRCS 

Workshop  Linguistic  Databases  2001,  pages 

187-195,2001 

[Niu  et  al  2003]  Niu,  C.,  W.  Ei,  J.  Ding,  and  R.K. 
Srihari  (to  appear  2003).  Orthographic  Case 
Restoration  Using  Supervised  Eearning  Without 
Manual  Annotation.  Proceedings  of  The  16th 

FLAIRS,  St.  Augustine,  FE 

[Riloff  1996]  [Automatically  Generating 

Extraction  Patterns  from  Untagged  Text.  AAAI-96. 

[Roche  &  Schabes  1997]  Emmanuel  Roche  & 
Yves  Schabes,  1997.  Finite-State  Eanguage 
Processing,  The  MIT  Press,  Cambridge,  MA. 

[Silberztein  1999]  Max  Silberztein,  (1999). 
INTEX;  a  Finite  State  Transducer  toolbox,  in 
Theoretical  Computer  Science  #231:1,  Elsevier 
Science 

[Srihari  1998].  A  Domain  Independent  Event 
Extraction  Toolkit,  AFRE-IF-RS-TR- 1998- 152  Final 
Technical  Report,  Air  Force  Research  Eaboratory, 
Information  Directorate,  Rome  Research  Site,  New 
York 

[Srihari  et  al  2000]  Srihari,  R,  C.  Niu  and  W.  Ei. 
(2000).  A  Hybrid  Approach  for  Named  Entity  and 
Sub-Type  Tagging.  In  Proeeedings  of  ANLP  2000, 
247-254,  Seattle,  WA. 


